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ABSTRACT 

This paper is aimed at developing a better understanding of the structure of the 
information that in contained in galaxy surveys, so as to find optimal ways to combine 
observables from such surveys. We first show how Jaynes' Maximal Entropy Principle 
allows us, in the general case, to express the Fisher information content of data sets 
in terms of the curvature of the Shannon entropy surface with respect to the relevant 
observables. This allows us to understand the Fisher information content of a data set, 
once a physical model is specified, independently of the specific way that the data will 
be processed, and without any assumptions of Gaussianity. This includes as a special 
case the standard Fisher matrix prescriptions for Gaussian variables widely used in the 
cosmological community, for instance for power spectra extraction. As an application 
of this approach, we evaluate the prospects of a joint analysis of weak lensing tracers 
up to second order in the shapes distortions, in the case that the noise in each probe 
can be effectively treated as model independent. These include the magnification, the 
two ellipticity and the four fiexion fields. At the two point level, we show that the 
only effect of treating these observables in combination is a simple scale dependent 
decrease of the noise contaminating the accessible spectrum of the lensing E-mode. 
We provide simple bounds to its extraction by a combination of such probes, as well 
as its quantitative evaluation when the correlations between the noise variables for 
any two such probes can be neglected. 

Key words: cosmology: cosmological parameters, cosmology: large-scale structure of 
the Universe, methods: data analysis, methods: statistical 



1 INTRODUCTION 

With cosmological data sets currently going through a rapid 
period of growth, it is increasingly important to quantita- 
tively understand the potential and limits of particular data 
sets to test a physical model or hypothesis. For this, the 
Fisher information matrix has become a widely used tool in 
cosmology. 

The concept of Fisher information has a long history. 
It was first coined by the statistician and geneticist R.A. 
Fisher (JFishcr 19251) under the name of intrinsic accuracy 
of frequency curves. It has found its way into the cosmo- 
logical community over the last decad e, where it is often 
used to optimise survey configurations (lAmara fc Refregier 
20071 : iParkinson et al.ll2007l : lAlbrecht et al.ll2006l : iBernstein 
2003 ) of planned cosmology experiments or to evaluate 
the expected errors on certain cosmological parameters 
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with some observables JTegmark et al]| 19971 : lTegmarklll997l : 
iHu fc Tegniax5ll999l : [Hu fc Jainll20ol ). 

Much of the work to date has been limited to par- 
ticular sets of observables and estimators. Usually, it is 
assumed that observational errors as well as the parameters 
probability distribution have Gaussian shape. The first aim 
of this work is to propose a framework to express the global 
Fisher information content of large data sets in a way that 
is independent of the specific ways that the data will be 
processed, and in realistic situations, where the exact sta- 
tistical properties of the data are not known precisely. This 
should then provide a well motivated basis point in order 
to perform systematic and robust trade-off studies. For this 
purpose a number of useful concepts already exist in the 
fields of information theory and probabi lity theory, suc h 
as Shannon entropy or relative entropy (|Kullbackl 1 19591 ). 
which we can use to gain a better understanding of what we 
can achieve with planned experiments. Specifically, we will 
show that we can achieve our aim by combining Fisher's 
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informati on measure with Jaynes' Principle of Maximal 
Entropy. (|javneslll98:3 : |javnes fc Bretthorstll2003l '). 

In a second step, as a concrete application of this 
approach, we investigate the joint entropy and infor- 
mation content of multiple observables of the same 
underlying, cosmologically interesting fi eld. This is a 
very r elevant situation in weak lensing (ISchneider et al.l 



19921: 



Bartelrnann fc Schneider 
20061 : 



2OOII: 



Refregied l2003l : 
the 



Munshi et all l2006l : ISchneider et all 120061 '). where 
distortions of galaxy images to any order are sourced by 
the lensing potential field. 

This paper is divided into the following sections. 
In section [2l we present in details our approach. We 
first review and develop some key properties of Fisher 
information, and its link to the Cramer Rao inequality. We 
put a strong emphasis on its interpretation as a measure 
of information on model parameters in a data set, that 
is obtained from the probability distribution of difi'erent 
observational outcomes as function of these same model 
parameters. Readers familiar with these aspects may jump 
to section 12.31 where we introduce Jaynes' Maximum 
Entropy Principle, and show how it ideally completes 
Fisher's information measure, allowing us to understand 
the information content of a data set on a physical model in 
the case of incomplete knowledge. In section [S] we show how 
the study of the Shannon entropy of a set of homogeneous 
fields provide a simple and model parameter independent 
answer to the question of the combination of the weak 
lensing observables shear, magnification and flexion. We 
provide in section |4] quantitative evaluation of the prospects 
of such a combination at the two-point level for typical 
dark energy surveys parameters, and conclude in section [S] 
with a summary of the results and a discussion. A set of 
appendices collects some technical details. 



2 FISHER INFORMATION AND JAYNES 
MAXENT PRINCIPLE 

The concept of Fisher information is rich and not limited 
to parameter error estimation. We review here a few sim- 
ple points of interest that justify the interpretation of the 
Fisher matrix as a measure of the information content of an 
experiment. Let us begin by considering the case of a single 
measurement X, with different possible outcomes, or reali- 
sations, X, and our model has a single parameter a. We also 
assume that we have knowledge, prior to the given exper- 
iment, of the probability density function px{x,a), which 
depends on our parameter a, that gives the probability of 
observing particular realisations for each value of the model 
parameter. The Fisher information, F, in X on a, is a non- 
negative scalar in this one parameter case. It is defined in a 
fully general way as a sum over all realisations of the data 
l|Fisheilll925l ): 



F^(a) = 



d\npx{x, a) 
da 



(1) 



Angle brackets will always stand for mean value with respect 
to the probability density function, i.e. for any function /, 



(/> 



Three simple but important properties of Fisher informa- 
tion are worth highlighting at this point. 

The first is that F'^ {a) is positive definite, and it 
vanishes if and only if the parameter a does not impact the 
data, i.e. if the derivative of px{x,a) with respect to a is 
zero for every realisation x. 

The second point is that it is invariant to invertible 
manipulations of the observed data. This can be seen by 
considering an invertible change of variable y = f{x), which, 
due to the rules of probability theory can be expressed as 



py(y,a) =px{x,a) 



Thus 



d\npY{jj,a) _ d\npx{x,a) 



da 



da 



(3) 



(4) 



leading to the simple equivalence that F'^ {a) = F^ {a). 
On the other hand, information may be lost when the 
tra nsformation is not unique in both directions, (e.g., 
see iRad 119731 . for a proof). For instance, if the data is 
combined to produce a new variable that could arise from 
different sets of data points. This is only the statement 
that manipulations of the data leads, at best, only to 
conservation of the information. 

The third point is that information from independent 
experiments add together. Indeed, if two experiments with 
data X and Y are independent, then the joint probability 
density factorises. 



pxY{x,y) =pxix)pY{y), 



(5) 



and it is easy to show that the joint information in the ob- 
servations decouples. 



F^'^'ia) =F''{a) + F'"{a). 



(6) 



dxpx{x,a)f{x). 



(2) 



These properties are making the Fisher information a 
meaningful measure of information. This is independent of 
its interpretation as providing error bars on parameters. It 
further implies that once a physical model is specified with 
a given set of parameters, a given experiment has a defi- 
nite information content that can only decrease with data 
processing. 



2.1 The case of a single observable 

To quantify the last point above, and in order to get an 
understanding of the structure of the information in a data 
set, we first review a simple situation, common in cosmol- 
ogy, where the extraction of the model parameter a from 
the data goes through the intermediate step of estimating a 
particular observable, D, from the data, x, with the help of 
which a will be inferred. A typical example could be, from 
the temperature map of the CMB (x), the measurement of 
the power spectra of the fluctuations (D), from which a cos- 
mological parameter (a) is extracted. The observable D is 
measured from x with the help of an estimator, that we call 
D, and that we will take as unbiased. This means that its 
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mean value, as would be obtained for instance if many real- 
izations of the data were available, converges to the actual 
value that we want to compare with the model prediction, 



D 



D(a) 



(7) 



A measure for its deviations from sample to sample, or the 
uncertainty in the actual measurement, is then given by the 
variance of D, defined as 



Var 



(*) 



D 



(8) 



In such a situation, a m ajor r ole is played by the so-called 
Cramer-Rao inequality (jRad (|l973l )). that links the Fisher 
information content of the data to the variance of the esti- 
mator, stating that 



Var(i))f-(a) ^ (^ 



(9) 



This equation holds for any such estimator D and any 
model parameter a. Two different interpretations of this 
equation are possible: 

The first bounds the variance of D by the inverse of 
the Fisher information. To see this, we consider the special 
case of the model parameter a being D itself. Although we 
are making in general a conceptual distinction between the 
observable D and the model parameter a, nothing requires 
us from doing so. Since a is now equal to D, the derivative 
on the right hand side becomes unity, and one obtains 

Var(i)) > -^. (10) 

The variance of any unbiased estimator _D of _D is therefore 
bounded by the inverse of the amount of information 
F^ [D) the data possess on D. If P-^ [D) is known it gives 
a useful lower limit on the error bars that the analysis of 
the data can put on this observable. 

The second reading of the Cramer-Rao inequality, 
closer in spirit to the present work, is to look at how 
information is lost by constructing the observable D, and 
discarding the rest of the data set. For this, we rewrite 
trivially equation ((9)1 as 






1 



(11) 
Var(D) 

The expression on the right hand side is the ratio of the 
sensitivity of the observable to the model parameter (^) , 
to the accuracy with which the observable can be extracted 
from the data, Var(D). One of the conceivable approaches 
in order to estimate the true value of the parameter a, is to 
perform a x^ fit to the measured value of _D . It is simple 
to show that this ratio, evaluated at the best fit value, is in 
fact proportional to the expected value of the curvature of 
X^(q) at this value. Since the curvature of the x^ surface 
describes how fast the value of the x^ is increasing when 
moving away from the best fit value, its inverse can be 
interpreted as an approximation to the error bar that the 
analysis with the help of D will put on a. 



on a, a loss given by the difference between the left and 
right hand side of that equation. While the latter may be 
interpreted as the information on a contained in the part of 
the data represented by D, we may have lost trace of any 
other source of information. 



2.2 The general case 

These considerations on the Cramer-Rao bound can be eas- 
ily generalised to the case of many parameters and many 
estimators of as many observables. Still dealing with a mea- 
surement X with outcomes x, we want to estimate a set of 
parameters 

0={a,(},---) (12) 

with the help of some vector of observables, 

Ii = {Di,--- ,D„) (13) 

that are extracted from x with the help of an array of unbi- 
ased estimators. 



D=(Di, 



,DJ 



D 



D 



(14) 



In this multidimensional setting, all the three scalar 
quantities that played a role in our discussion in section 
12.11 i.e. the variance of the estimator, the derivative of the 
observable with respect to the parameter, and the Fisher 
information, are now matrices. 

The Fisher information F in X on the parameters 9 
is defined as the square matrix 



F^(0) 



dlnpx dlnpx 



J a/3 \ aa dl3 /' ''^^' 

While the diagonal elements are identical to the informa- 
tion scalars in equation ([l]), the off diagonal ones describe 
correlated information. The Fisher information matrix still 
carries the three properties we discussed in section (2] 
The variance of the estimator in equation ([8]) now becomes 
the covariance matrix cov(D) of the estimators D, defined 
as 



("). 



D,D, 



D,D, 



(16) 



Finally, the derivative of the observable with respect to the 
parameter, in the right hand side of ^, becomes a matrix 
A, in general rectangular, defined as 

A.,^^, (17) 

where a runs over all elements of the set of model param- 
eters. Again, the Cramer-Rao inequality provides a useful 
link between these three matrices, and again there are two 
approaches to that equation : first, as usually presented in 
the literature (jRao 19731 ), in the form of a lower bound to 
the covariance matrix of the estimators. 



(d) ^ A^ [f-^ (6») 



(18) 



The inequality between two symmetric matrices A ^ B hav- 
ing the meaning that the matrix A — B is positive definiteQ 



Thus, equation (|11|) shows that by only considering D 
and not the full data set, we may have lost information 



^ A matrix A is called positive definite when for any vector x 
holds that x'^ Ax ^ 0. A concrete implication for our purposes 
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If, as above, we consider the special case of identifying the 
parameters with the observables themselves, the matrix A 
is the identity matrix, and so we obtain that the covariance 
of the vector of the estimators is bounded by the inverse 
of the amount of Fisher information that there is on the 
observables in the data. 



cov(D)^ [^^""(0)]" 



(19) 



Second, we can turn this lower bound on the covariance to 
a lower bound on the amount of information in the data 
set as well. By rearranging equation (|18p . we obtain the 
multidimensional analogue of equation (|ll|l . which describes 
the loss of information that occurs when the data is reduced 
to a set of estimators. 



F^(6>) 



^ A 



D 



(20) 



For the sake of completeness, a proof of these two inequali- 
ties can be found in Appendix [XI 

Instead of giving a useful lower bound to the covari- 
ance of the estimator as in equation (|18|) . in this form the 
Cramer-Rao inequality makes clear how information is in 
general lost when reducing the data to any particular set 
of estimators. The right hand side may be seen, as before, 
as the expected curvature of a y^ fit to the estimates 
produced by the estimators D, when evaluated at the best 
fit value, with all correlations fully and consistently taken 
into account. 

In the next two sections, we show how Jaynes' Maximal En- 
tropy Principle allow us to understand the total information 
content of a data set, once a model is specified, in very sim- 
ilar terms. 



2.3 Jaynes Maximal Entropy Principle 

In cosmology, the knowledge of the probability distribution 
of the data as function of the parameters, px{x,0), which 
is compulsory in order to evaluate its Fisher information 
content, is usually very limited. In a galaxy survey, a 
data outcome x would be typically the full set of angular 
positions of the galaxies, together with some redshift 
estimation if available, to which we may add any other 
kind of information, such as luminosities, shapes, etc. Our 
ignorance of both initial conditions and of many relevant 
physical processes does not allow us to predict either 
galax;y positions in the sky, or all interconnections with 
all this additional information. Our predictions of the 
shape of px is thus limited to some statistical properties, 
that are sensitive to the model parameters 6, such as the 
mean density over some large volume, or certain types of 
correlation functions. 



is e.g. that the diagonal entries of the left hand side of Hist or 
11191 . which are the individual variances of each estimator Di, 
are greater than those of the right hand side. For man y more 
prop erties of positive definite matrices, see for instance IIBhatial 
|2007|) 



In fact, even if it were possible to devise some proce- 
dure in order to get the exact form of px , it may eventually 
turn out to be useless, or even undesirable, to do so. The 
incredibly large number of degrees of freedom of such a 
function is very likely to overwhelm the analyst with a mass 
of irrelevant details, which may have no relevant significance 
on their own, or improve the analysis in any meaningful way. 

These arguments call for a kind a thermodynamical 
approach, which would try and capture those aspects of 
the data which are relevant to our purposes, reducing the 
number of degrees of freedom in a drastic way. Such an 
approach alre ady exists in the field of probability theory 
(|Javneall957^ . It is based on Shannon's con cept of entropy 
of a probability distribution (|Shannonlll948l ) and shed new 
light on the connection between probability theory and 
statistical mechanics. 

As we have just argued, our predictive knowledge of 
px(x,0) is limited to some statistical properties. Let us 
formalise this mathematically, in a similar way as in section 
12.21 Astrophysical theory gives us a set of constraints on 
the shape of px , in the form of averages of some functions 

Oi, 



O.ie) = (o,{x)){0), i = l, 



(21) 



where px enters through the angle brackets. As an ex- 
ample, suppose the data outcome a; is a map of the 
matter density field as a function of position. In this case, 
one of these constraints Oi could be the mean of the field 
or its power spectrum, as given by some cosmological model. 

The role of this array O — (Oi,--- ,On) is to repre- 
sent faithfully the physical understanding we have of 
Px, according to the model, as a function of the model 
parameters 0. In the ideal case, some way can be devised to 
extract each one of these quantities Oi from the data and 
to confront them to theory. 

The set of observables D, that we used in section 12.21 
would be a subset of these predictions O, and we henceforth 
refer to O as the 'constraints'. 



2.4 Maximal entropy distributions 

Although px must satisfy the constraints (|21|l . there may 
still be a very large number of different distributions com- 
patible with these. However, a very special status among 
these distributions has the one which maximises the value 
of Shannon's entropjlf], defined as 



S 



dxpx{x,0) hipx{x, 0). 



(22) 



^ Formally, for continuous distributions the reference to another 
distribution is needed to render S invariant with respect to in- 
vertible transformations, leading to the concept of the entropy of 
Px relative to another distribution qx , S = J dx px (a:) In ^'^i^^ , 
also called KuUback-Leibler divergence. The quantity defined in 
the text is more precisely the entropy of px (x) relative to a uni- 
form probability density functi on. For an rece nt account on this, 
close in spirit to this work, see lCatichal 1 120081) . 



Probe combination in large galaxy surveys 5 



First introduced by Shannon (|Shannonll 19481 ) as a measure 
of the uncertainty in a distribution on the actual outcome, 
Shannon's entropy is now the cornerstone of information 
theory. Jaynes' Maximal Entropy Principle states that the 
px for which this measure 5* is maximal is the one that best 
deals with our insufficient knowledge of the distribution, 
and should b e therefore p r eferred. We refer the read er to 
Ja ynes' work (|jayneslll983l : Ijavnes fc BretthorstI 12003 ) and 
to ICatichal (|2008l ) for detailed discussions of the role of 
entropy in probability theory and for the conceptual basis 
of maximal entropy methods. Astronomical applications 
related to some extent to Jaynes's ide as include image 
recon s truction from noisy dat a, (see e.g. (ISkilling fc BrvanI 
1 1984 IStarck fc PantinI Il996l : iMaisinger et all 
references therein) , mass profiles 



2004 ) and 
reconstruction from 



shear estimates (|Bridle et al.ll 19981 : [Marshall et al.ll2002l ). as 
well as model comparis on when very few data is available 
(jZunckel fc Trottall2007l ). We will see that for our purposes 
as well it provides us a powerful tool, and that the Maximal 
Entropy Principle is the ideal complement to Fisher infor- 
mation, fitting very well within our discussions in section [2] 
on the Cramer-Rao inequality. 

Intuitively, the entropy S of px tells us how sharply 
constrained the possible outcomes x are, and Jaynes' 
Maximal Entropy Principle selects the px which is as 
wide as possible, but at the same time consistent with the 
constraints (|2H) that we put on it. The actual maximal 
value attained by the entropy S, among all the possible dis- 
tributions which satisfy (|2ip , is a function of the constraints 
O, which we denote by 



S{Oi 



,On). 



(23) 



Of course it is a function of the model parameters 9 as well, 
since they enter the constraints. As we will see, the shape 
of that surface as a function of O, and thus implicitly as a 
function of 0, is the key point in understanding the Fisher 
information content of the data. In the following, in order 
to keep the notation simple, we will omit the dependency 
on of most of our expressions, though it will always be 
implicit. 

The problem of finding the distribution px that maximises 
the entropy (|22|) , while satisfying the set of constraints (|2H) , 
is an optimiz ation exer cise. We can quote the end result 
(|javneslll98i chap. ll). (|Catichall2008l . chap. 4): 
The probability density function px, when it exists, has the 
following exponential form. 



px{x) = ^exp 



y^^XjOiix) 



(24) 



in which to each constraint Oi is associated a conjugate 
quantity Ai, that arises formally as a Lagrange multiplier in 
this optimization problem with constraints. The conjugate 
variables A's are also called 'potentials', terminology that 
we will adopt in the following. We will see below in equa- 
tion H28p that the potentials have a clear interpretation, in 
the sense that the each potential Xi quantifies how sensitive 
is the entropy function S in (|23[) to its associated constraint 
Oi. The quantity Z, that plays the role of the normalisation 
factor, is called the partition function. Since equation (|24p 



must integrate to unity, the explicit form of the partition 
function is 



Z{Xi, ■ ■ ■ ,Xn) = dx exp -y^^XiOi{x) 



(25) 



The actual values of the potentials are set by the constraints 
(|21|l . They reduce namely, in terms of the partition function, 
to a system of equations to solve for the potentials. 



0,^-^\nZ, i^l, 
oXi 



(26) 



The partition function Z is closely related to the entropy S 
oi Px- It is simple to show that the following relation holds. 



S^\nZ + J2^iOi 



(27) 



and the values of the potentials can be explicitly written 
as function of the entropy, in a relation mirroring equation 



A. 



dS 

dOr' 



,n 



(28) 



Given the nomenclature, it is of no surprise that a deep anal- 
ogy between this formalism and statistical physics does ex- 
ist. Just as the entropy, or partition function, of a physical 
system determines the physics of the system, the statisti- 
cal properties of these maximal entropy distributions follow 
from the functional form of the Shannon entropy or its par- 
tition function as a function of the constraints. For instance, 
the covariance matrix of the constraints is given by 

d^lnZ 



((o,(x)-O0(o.(x)-O,)> = 



dXidXi 



(29) 



In statistical physics the constraints can be the mean en- 
ergy, the volume or the mean particle number, with poten- 
tials being the temper ature, the pre ssure and the chemical 
potential. We refer to IjavnesI (| 19571 ) for the connection to 
the physical concept of entropy in thermodynamics and sta- 
tistical physics. 



2.5 The structure of the information in large data 

sets 



With our choice of probabilities px given by equation (|24p . 
the amount of Fisher information on the parameters — 
{a, /?,■■■) of the model can be evaluated in a straightfor- 
ward way. The dependence on the model goes through the 
constraints, or, equivalently, through their associated poten- 
tials. It holds therefore that 



dlnpxjx) 
da 



dlnZ v-^ 9Xi , , 

i-l 

J2p[0.-o.{x)], 
^ — ' aa 



(30) 



where the second line follows from the first after application 
of the chain rule and equation (|26|) . Using the covariance 
matrix of the constraints given in (|29p . the Fisher informa- 
tion matrix, defined in (|15fl , can then be written as a double 
sum over the potentials. 



F, 



a/3 



E 



dXj 9^ In Z dXj 
da dXidXj dp 



(31) 
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There are several ways to rewrite this expression as a func- 
tion of the constraints and/or their potentials. First, it can 
be written as a single sum by using equation (|26|l as 



F, 



al3 



da dp ' 



(32) 



Alternatively, since we will be more interested in using the 
constraints as the main variables, and not the potentials, we 
can show, using equation (|28|) . that it also takes the form[j 



F, 



afl 



— E 



i.j — l 



do, 9'g do J 
da dO.Oj dp 



(33) 



We will use both of these last expressions in the following 
parts of this work. 

Equation (|33p presents the total amount of information on 
the model parameters 6 in the data X, when the model 
predicts the set of constraints Oi. The amount of informa- 
tion is in the form of a sum of the information contained 
in each constraint, with correlations taken into account, as 
in the right hand side in equation H20p . In particular, it is 
a property of the maximal entropy distributions, that if the 
constraints Oi are not redundant, then it follows that the 
curvature matrix of the entropy surface —d^S is invertible 
and is the inverse of the covariance matrix d^ In Z between 
the observables. To see this explicitly, consider the derivative 
of equation (|26[) with respect to the potentials. 



do, _ d'^lnZ 
d\j dXidXj 



(34) 



The inverse of the matrix on the left hand side, if it can be 
inverted, is —gTf-, which can be obtained taking the deriva- 
tive of equation (|28p . with the result 



dX^ 



d^S 



dOj dO,dOj ' 



(35) 



We have thus obtained in equation ((33}, combining Jaynes' 
Maximal Entropy Principle together with Fisher's infor- 
mation, the exact expression of the Cramer-Rao inequality 
H20p for our full set of constraints, but with an equality 
sign. 

We see that the choice of maximal entropy probabili- 
ties is fair, in the sense that all the Fisher information 
comes from what was forced upon the probability density 
function, i.e. the constraints. No additional Fisher infor- 
mation is added when these probabilities are chosen. In 
fact, this requirement alone is enough to single out the 
maximal entropy distributions, as being precisely those for 
which the Cramer-Rao inequality is an equality. This can 
be understood in terms of suffi cient statistics a nd go es back 
to l|Pitman fc Wis hart' '1936*') and 'KopmanI (| 19361 '). This 
was shown in ( Zografos fc Ferentinosiil994l '). We provide in 



^ We note that this result is valid only for maximal entropy dis- 
tributions and is not equivalent to the second derivative of the 
entropy with respect to the parameters themselves. However it 
is formally identical to the corresponding expression for the in- 
formation content of dist ribu tions within the ex ponential family 
lljennrich fc Moore|[l975l ). or llvan den Bosll2007l . chapter 4), once 
the curvature of the entropy surface is identified with the gener- 
alized inverse of the covariance matrix. 



appendix |X] for completeness a similar argument that if the 
equality sign holds in equation pop for some distribution, 
then it is the one that maximises the entropy relative to 
some other distribution. 

In the special case that the model parameters are the 
constraints themselves, we have 



FA 



d's 

dO^O, 



dK 
do 



= -^> (36) 



which means that the Fisher information on the model 
predictions contained in the expected future data is directly 
given by the sensitivity of their corresponding potential. 
Also, the application of the Cramer-Rao inequality, in 
the form given in equation H19p . to any set of unbiased 
estimators of O, shows that the best joint, unbiased, 
reconstruction of O is given by the inverse curvature of the 
entropy surface —d^S, which is, as we have shown, d^ InZ. 

We emphasise at this point that although the amount 
of information is seen to be identical to the Fisher informa- 
tion in a Gaussian distribution of the observables with the 
above correlations, nowhere in our approach do we assume 
Gaussian properties. The distribution of the constraints 
Oi{x) themselves is set by the maximal entropy distribution 
of the data. 



2.6 Redundant observables 

We have just seen that in the case of independent con- 
straints, the entropy of px provides through equation (|33p 
both the joint information content of the data, as well as 
the inverse correlation matrix between the observables. 
However, if the constraints put on the distribution are 
redundant, the correlation matrix is not invertible, and 
the curvature of the entropy surface cannot be inverted 
either. We show however that in these cases, our equations 
for the Fisher information content (|31l 1321 I33p are still 
fully consistent, dealing automatically with redundant 
information to provide the correct answer. 

An example of redundant information occurs trivially if 
one of the functions Oi{x) can be written in terms of the oth- 
ers. For instance, for galaxy survey data, the specification of 
the galaxy power spectrum as an constraint, together with 
the mean number of galaxy pairs as function of distance, 
and/or the two-points correlation function, which are three 
equivalent descriptions of the same statistical property of 
the data. Although the number of observables O, and thus 
the number of potentials, describing the maximal entropy 
distribution greatly increases by doing so, it is clear that we 
should expect the Fisher matrix to be unchanged, by adding 
such superfluous pieces of information. A small calculation 
shows that the potentials adjust themselves so that it is 
actually the case, meaning that this type of redundant in- 
formation is automatically discarded within this approach. 
Therefore, we need not worry about the independency of 
the constraints when evaluating the information content of 
the data, which will prove convenient in some cases. 
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There is another, more relevant type of redundant infor- 
mation, that allow us to understand better the role of the 
potentials. Consider that we have some set of constraints 
{Oi }iLi , and that we obtain the corresponding px that max- 
imises the entropy. This px could then be used to predict 
the value On+i of the average some other function o„+i(a;), 
that is not contained in our set of predictions. 



{o„+i{x)) =: On+l- 



(37) 



For instance, the maximal entropy distribution built with 
constraints on the first n moments of px, will predict some 
particular value for the n ~f 1-th moment, On+i, that the 
model was unable to predict by itself. 

Suppose now some new theoretical work provides the shape 
of 0„+i as a function of the model parameters. This new 
constraint can thus now be added to the previous set, and 
a new, updated px is obtained by maximising the entropy. 
There are two possibilities at this point : 

It may occur that the value of On+i as provided 
by the model is identical to the prediction by the maximal 
entropy distribution that was built without that constraint. 
Since the new constraint was automatically satisfied, the 
maximal entropy distribution satisfying the full set of n -f 1 
constraints must be equal to the one satisfying the original 
set. From the equality of the two distributions, which 
are both of the form (|24[1. it follows that the additional 
constraint must have vanishing associated potential. 



An + l — 0, 



(38) 



while the other potentials are pairwise identical. It follows 
immediately that the total information, as seen from equa- 
tion (|32[) is unaffected, and no information on the model 
parameters was gained by this additional prediction. A cos- 
mological example would be to enforce on the distribution 
of some field, together with the two-points correlation func- 
tion, fully disconnected higher order correlation functions. 
It is well known that the maximal entropy distribution 
with constrained two-points correlation function has a 
Gaussian shape, and that Gaussian distributions have 
disconnected points function at any order. No information 
is thus provided by these field moments of higher order in 
this case. 

This argument shows that, for a given set of original 
constraints and associated maximal entropy distribution, 
any function f{x), which was not contained in this set, 
with average F, can be seen as being set to zero poten- 
tial. Such F's therefore do not contribute to the information. 



Of course, although the formulae of this section are 
valid for any model, it requires numerical work in order to 
get the partition function and/or the entropy surface in a 
general situation. 



2.7 The entropy and Fisher information content 
of Gaussian homogeneous fields 

In order to close this section, we obtain now the Shannon 
entropy of a family of fields when only the two-point cor- 
relation function is the relevant constraint, that we will use 
extensively in the next section dealing with our cosmological 
application. It is easily obtained by a straightforward gener- 
alisation of the finite dimensional multivariate case, where 
the means and co variance matrix of the variables are known. 
It is well known (jShannonll 19481 ) that the maximal entropy 
distribution is in this case the multivariate Gaussian distri- 
bution. Denoting the constraints on px with the matrix D 
and vector /i 



A 



(XiXj) 



(40) 



^J■i = (xi) , i,j = !,■■■ ,N 
the associated potentials are given explicitly by the relations 



■q= -C V, 



where the matrix C is the covariance matrix 



C := D - MM 



(41) 



(42) 



The Shannon entropy is given by, up to some irrelevant ad- 
ditive constant, 



S{D, m) = i lndet(D - /x/x"^). 



(43) 



The fact that about half of the constraints are redundant, 
due to the symmetry of the D and C matrices, is reflected 
by the fact that the corresponding inverse correlation matrix 
in equation (|33|l . 



r5 



dDudD^ 



ijUUkl 



-1^ - k^^^i'^ (^^) 



is not invertible as such if we considers all entries of the ma- 
trix D as constraints. Of course, this is not the case anymore 
if only the independent entries of D form the constraints. 



More interesting is, of course, the case where this 
additional constraint differs from the predictions obtained 
from the original set {Oi}"^j. Suppose that there is a 
mismatch 60n+i between the predictions of the maximal 
entropy distribution and the model. In this case, when 
updating px to include this constraint, the potentials are 
changed by this new information, a change given to first 
order by 



5\i 



d^s 



dO.dOr,^ 



-50n + l 



,71-f 1, 



(39) 



and the amount of Fisher information changes accordingly. 



2.7.1 Fields, means and correlations 

Using the handy formalism of functional calculus, we can 
straightforwardly extend the above relations to systems 
with infinite degrees of freedom, i.e. fields, where means 
as well as the two-point correlation functions are con- 
strained. A realisation of the variable X is now a field, 
or a family of fields </> = (c^i,--- ,4>n), taking values on 
some 71-dimensional space. The expressions above in the 
multivariate case all stays valid, with the understanding 
that operations such as matrix multiplications have to be 
taken with respect to the discrete indices as well as the 
continuous ones. 
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With the two-point correlation function and means 

Pij(x,y) = (0i(x)0j(y)) 
0i(x) = ((?!)i(x)> 

we still have, up to an unimportant constant, 
S= ilndet(p-(/)</)^). 



(45) 



(46) 



In n-dimensional Euclidean space, within a box of volume 
V for a family of homogeneous fields, it is simplest to work 
with the spectral matrices. These are defined as 



l(,^,(k)0;(k')) = p,,(k)5kk' 



(47) 



where the Fourier transforms of the fields are defined 
through 



A,(k) 



cTx 4>i{x) e 



(48) 



It is well known that these matrices provide an equivalent 
description of the correlations, since the they form Fourier 
pairs with the correlation functions 



'..(x,y) = :^E^':'(k)e*'*''^ 



■y) 



Pij(x- 



(49) 



In this case, the entropy in equation (|46p reduces, again 
discarding irrelevant constants, to an uncorrelated sum over 
the modes, 



5 = - In det 



PJO) 
V 



+ 9 I] In det 



P(k) 
V ' 



(50) 



which is the straightf orward mutlidimensional version of 
(jTavlor fc Wattsll200ll . eq. 39). Comparison with equation 
H43p shows the well-known fact that the modes can be seen 
as Gaussian, uncorrelated and complex variables with cor- 
relation matrices proportional to P(k). All modes have zero 
mean, except for the zero-mode, which, as seen from its def- 
inition, is proportional to the mean of the field itself. Ac- 
cordingly, taking the appropriate derivatives, the potentials 
A(k) associated to P(k) read 



A(k) = 
A(0) = 



P(k)- 
P(0) 



k/0 



V 



and those associated to the means (f>, 



V = 



V 



(51) 



(52) 



Note that although the spectral matrices are, in general, 
complex, they are hermitian, so that the determinants are 
real. The amount of Fisher information in the family of fields 
is easily obtained with the help of equation (|32l) , with the 
familiar result 



F, 



ap 



4e- 



P-.,k)^M;)p-.,k)5§ffl 



+ 



90^ 



'EM 

V 



da 

dp' 



(53) 



with -Pc(k) being the connected part of the spectral matrices. 



P,(k) = P(k) - 5^oVcj)<t/ 



(54) 



These expressions are of course also valid for isotropic fields 
on the sphere. With a decomposition in spherical harmonics, 
the sum runs over the multipoles. 



3 COSMOLOGICAL APPLICATION TO WEAK 
LENSING OBSERVABLES 

Gravitational lensing, which can be used to measure the 
distribution of mass along the line of sight, has been recog- 
nized as powerful probe of the dark components of the Uni- 
verse ([Schneider et al. 19921: Bartelmann fc Schneideill200ll : 



iRefregieij l2003l : iMunshi et al.l l2006l : ISchneider et aLlbo" 
since it is sensitive to both the geometry of the Universe, and 
to the growth of structure. Weak lensing data is typically 
used in two ways. The first, which is deployed for cosmo- 
logical parameter fitting, reli es on measuring the correlated 
distortions in galaxy images (JAlbrecht et al.ll2006l ). The sec- 
ond approach uses each galaxy to make a noisy measure- 
ment of the lensing signal at that position. These point esti- 
mates are then u sed to reconstruct the dar k matter density 
distri bution (e.g. Kaiser fc Squires, 1993; Seitz fc Schneider] 
I2OOH ). Most of the measurements of weak lensing to date 
have focused on the shearing that galaxy images experi- 
ence. However, gravitational lensing causes a number of 
other distortions of galaxy images. These include change in 
size, which is related to the magnificatio n, and higher orde r 
image distortions known as the flexion (jBacon et al.ll2006l ). 
A number of techniques have been developed for measur- 
ing these higher or der images distortions, suc h as HOLICS 
fOk ura et al.l 120071 1 and shapelets methods (jMassev et al.l 
|2003). Since all of the image distortions originate from the 
same cause, i.e. the lensing potential fleld, the information 
content of any two lensing measurements must be degen- 
erate. At the same time, since each method has different 
systematics and specific noise properties, combining multi- 
ple measurement can may bring substantial benefits. Some 
recent works have looked at the impact of combin ing shear 
and fiexion mea surements for ma ss reconstruction (jEr et al.l 
I2OIOI : iPires fc A mara 2010; Vcl ander et all I2OI0I ). as weU 
as the benefits for breakin g multiplicative bias o f including 
galaxy size measurements (|Vallinotto et al.ll2010l ). 



3.1 Linear probes 

The predictive power of some observable Oc of a central field 
(for instance its power spectrum at some mode) translates 
into an array of constraints Oi, i = 1,- ■ ■ ,n m the noisy 
probes, that we could try and extract and confront to theory 



O,(0) = /,(O.(0)), j = l,---,n 



(55) 



for some functions fi. 

For the purpose of this work, the case of functions linear 

with respect to Oc is generic enough, i.e. we will consider 

that 






0, j = 1, • • ■ ,n. 



(56) 
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The entropy S of the data is a function of the n constraints 
O. It is however fundamentaUy a function of Oc since it does 
enter all of these observables. It is therefore very natural to 
associate a potential Ac to Oc, although it is not itself a 
constraint on the probability density function. In analogy 
with 

^' = TTT^i I = 1,- • • ,n (57) 



we define 



do. 



X - '^^ (O 



1 *^m)i 



(58) 



with the result, given by application of the chain rule, of 



Ac = A- 



dOc 



(59) 



On the other hand, the impact of a model parameter on each 
observables can be similarly written in terms of the central 
observable Oc, 



do 

da 



dOc df 
da dOc 



(60) 



It follows directly from these relations (|59p and (1601) . and 
the linearity of fi, that the joint information in the full set 
of constraints O, given in equation (|32p as a sum over all n 
constraints, reduces to a formally identical expression with 
the only difference that only Oc enters : 

^x _dO d\ _ dXcdOc ,„^, 

^-^-^'dp-^W ^ ' 

which can also be written in the form analog to H33p . 

dOc d'^S dOc 
' da dOl dp ■ 
This last equation shows that all the effect of combining this 
set of constraints have been absorbed into the second total 
derivative of the entropy. This second total derivative is the 
total amount of information there is on the central quantity 
Oc in the data. Indeed, taking as a special case of model 
parameter to the central quantity itself, i.e. 

a = P = Oc, (63) 

one obtains now that the full amount of information in X 
on Oc is 



p-^ — 



(62) 



F, 



OaO^ 



d^S 
dO?, 



(Oir-- ,0„) = 



(64) 



A simple application of the Cramer Rao inequality pre- 
sented in equation (|lip shows that this effective variance 
iTeff is the lower bound to an unbiased reconstruction of the 
central observable from the noisy probes. 

These considerations on the effect of probe combination in 
the case of a single central field observable Oc generalize 
easily to the case where there are many, {Ol, ■ • • , OJT"). In 
this case, each central field quantity leads to an array of 
constraints in the form of equation (|55p . it is simple to show 
that the amount of Fisher information can again be written 
in terms of the information associated to the central field, 
with an effective covariance matrix between the O'cS. The 
result is 






E 



dOl d^S dOJ 
da dOiOi dp 



(65) 



All the effects of probe combination are thus encompassed 
in an effective covariance matrix Ecg of the central field 
observables. 



d:'s 

dO'cOi 



(66) 



Again, an application of the Cramer Rao inequality, in the 
multi-dimensional case, shows that this effective covariance 
matrix is the best achievable unbiased joint reconstruction 

of(oj,---,or). 

We now explore further the case of linear probes of 
homogeneous Gaussian fields, which is cosmologically 
relevant and can be solved analytically to full extent. We 
will focus on zero mean fields, for which according to our 
previous section the entropy can be written in terms of the 
spectral matrices, up to a constant. 



5=i^lndetP(k). 



(67) 



3.2 Linear tracers at the two-point level 

A standard instance of a linear tracer (pi of some central field 
K in weak lensing is provided by a relation in Fourier space 
of the form 



i(k) = iiiK(k) + ei(k), 



(68) 



for some noise term ii, uncorrelated with k, and coefficient 
Vi . Typically, if one observes a tracer of the derivative of the 
field K, then the vector v would be proportional to — ik. We 
are ignoring here any observational effect, such as incomplete 
sky coverage, that would require corrections to this relation. 
It is clear from this relation that the spectral matrices of 
this family take a special form of equation (|55|l ; defining the 
spectrum of the k field by P", we obtain by putting this 
relation (|68p into (|47|l . that the spectral matrices can be 
written at each mode in the form 

P = P'^vv^ + N, (69) 

where v' is the hermitian conjugate of v = [vi,--- ,v„). 
The matrix N is the spectrum of the noise components e, 



^».(k)=:^(?.(k)g*(k)>. 



(70) 



Our subsequent results hold for any family of tracers that 
obey this relation. While the special case in l|68p enter this 
category, this need not be the only instance. All the weak 
lensing observables we deal with in this work will satisfy 
equation (|69|l . 



Both the n-dimensional vector v and the noise matrix 
A*' can depend on the wave vector k, but they are inde- 
pendent from the model parameters. The matrix N of 
dimension n x n is the noise component of the spectra of 
the fields, typically built from two parts. The first is due to 
the discrete nature of the fields, since such data consist in 
quantities measured where galaxies sits, and the second to 
the intrinsic dispersion of the measured values. 
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3.3 Joint entropy and information content 

Information on the model parameters enters through P'^ 
only. To evaluate the full information content, we need only 
evaluate eq. (|67|) with the spectral matrix given in (|69p . 
keeping in mind the result from last section, that we need 
only the total derivative with respect to P". In other words, 
any additive terms in the expression of the entropy that are 
independent of P" can be discarded. 

This determinant can be evaluated immediately. Defining 
for each mode the real positive number A'^off through 

1 

which can be seen as an effective noise term, a simpl4j cal- 
culation shows that the joint entropy 1)67^ is equivalent to 
the following, where the n dimensional determinant has dis- 
appeared. 



v^TV" 



(71) 



^=jEln(^''(k) + iVci 



:(k)). 



(73) 



Comparison with equation (|67|) shows that we have with 
this equation (|73p the entropy of the field k itself, where 
all the effects of the joint observation of this n fields have 
been absorbed into the effective noise term A'eff, that 
contaminates its spectrum. It means that the full combined 
information in the n probes of the field k is equivalent to 
the information in k, observed with spectral noise A'eff. 



Our result (|64p applied to (|73p puts bounds on recon- 
struction of the field k out of the observed samples, 
which can be at best reconstructed with a contaminating 
noise term of A'eff in its spectrum, whose best unbiased 
reconstruction is given by 



2(P^(k)-f Afeff(k))^ 



(74) 



Since the effect of combining these probes at a single mode 
is only to change the model independent noise term, the 
parameter correlations and degenaracies as approximated 
by the Fisher information matrix stay unchanged, whatever 
the number of such probes is. We have namely from (|73p 
that at a given mode k, the Fisher information matrix reads 



F, 



all 



191nP'=(k)ainP''(k) 



dc 



dp 



with 



P'=(k)=P'=(k)+Areff(k). 



(75) 



(76) 



From the point of view of the Fisher information, it makes 
formally no difference to extract the full set of n{n — l)/2 
independent elements of each spectral matrices, or recon- 
struct the field k and extract its spectrum. They carry 
indeed the same amount of Fisher information. 



^ We have namely for any invertiblc matrix A and vectors u, v 
the matrix determinant lemma, 



dct ( A -I- uv^ 



dct(A) (l + v'^A" 



(72) 



These results still hold when other fields are present 
in the analysis, which are correlated with the field n. To 
make this statement rigorous, consider in the analysis 
on top of our n samples of the form (|68p of k, another 
homogeneous field 9, with spectrum P (k), and cross 
spectrum to k, given by P "(k) The full spectral matrices 
are in this case 



P(k) = 



P''(k)vv^-fA' P(k)'=''v 



(77) 



Again, the determinant of this matrix can be reduced to a 
determinant of lower dimension, leading to the equivalent 
entropy 



s- 



est H — In det 



P'*(k)+ A'eff P"''(k) 



p''"(k) 



^"(k) 



(78) 



It shows that the the full set of n + 1 fields can be reduced 
without loss to two fields, k, and 9, with the effective noise 
A'^eff contaminating the spectrum of k. 

Note that the derivation of our results do not refer to 
any hypothetical estimators, but came naturally out of the 
expression of the entropy. 



3.4 Weak lensing probes 

We now seek a quantitative evaluation of the full joint infor- 
mation content of the weak lensing probes in galaxy surveys, 
up to second order in the image distortions of galaxies. The 
data X consists of a set of fields, which are discrete point 
fields, which take values where galaxies sits. We work in the 
two-dimensional flat sky limit, using the more standard no- 
tation 1 for the wave vector, and decompose it in modulus 
and polar coordinate as 



i = ; 



cos ipi 



(79) 



For the scope of this paper, we will throughout assume that 
the intrinsic values of each probe are pairwise uncorrelated, 
as commonly done. Also, we will assume that the set of 
points on which the relevant quantities are measured show 
low enough clustering so that corrections to the spectra due 
to intrinsic clustering can be neglected. This is however not 
a limitation of our approach, since corrections to the above 
assumptions, such as the introduction of some level of in- 
trinsic alignment, can be accommodated for by introducing 
appropriate terms in the noise matrices A'^(k) in (|71|l . As a 
central field to which all our point fields relates, we take for 
convenience the isotropic convergence field k, with spectrum 



c"(i) = c"(0. 



(80) 



In the case of pairwise uncorrelated intrinsic values that we 
are following, we see easily from (171(1 that by combining any 
number of such probes the effective noise is reduced at a 
given mode according to 



A'", 



-tot / y l\Ti 



(81) 



We therefore only need to evaluate the effective noise for 
each probe separately, while their combination follows (|8ip . 
To this aim, the evaluation of the spectral matrices (|69p . 
giving us TVeff, is necessary. The calculations for this are 



Probe combination in large galaxy surveys 11 



presented in appendix [Bl and we use the final results in this 
section. 



3.4-1 First order, distortion matrix 

To first order, the distortion induced by weak lensing on 
a galaxy image is described by the distortion matrix that 
contains the shear, 7, and convergence, k, which come from 
the second derivative s of the lensing potential field ^p, (e.g. 
[Schneider et al.ll2006D 



K + 71 72 

72 K — 71 



= V',i. 



The shear components read 



71 



(V'.: 



- V',22) , 72 = V',: 



(82) 



(83) 



and we assume they are measured from the apparent ellip- 
ticities of the galaxies, with identical intrinsic dispersion a^. 
Denoting with n-y the number density of galaxies for which 
ellipticity measurements are available, the effective noise is 



n:^-- 



(84) 



The information content of the two observed ellipticity 
fields is thus exactly the same as the one of the convergence 
field, with a mode independent noise term as above. 

To reach for the k component of the distortion ma- 
trix, we imagine we have measurements of their angular 
size Sobs, with intrinsic dispersion a^. The intrinsic sizes 
of the galaxies sint gets transformed through weak lensing 
according to 



Sobs = Sint(l + CtsU) 



(85) 



The coefficient Os , is equal to unity in pure weak lensing the- 
ory, but we allow it to take other values, since in a realistic 
situation, other effects such as magnification bias e ffectively 
enter this coefficient (e.g. (|Vallinotto et al.l 120101 )). Under 
our assumption that the correlation of the fiuctuations in in- 
trinsic sizes can themselves be neglected, the effective noise 
reduces to 



iVoff: 



Oj_ 

Sint 



(86) 



This combination of Qs with the dispersion parameters s and 
as becomes the only relevant parameter in our case, and not 
the value of each of them. 



Table 1. Dispersion parameters used in figure [T] 



crjr asec 



— 1 1 CT^ 

aa asec ^ -S-^?- 



0.25 



0.04 



0.04 



0.9 



and are extracted from measurements with intrinsic dis- 
persion a% and ag. The effective noise is this time mode- 
dependent. 



1 



Git 



2 f nF_ nG_ 
\al al 



(88) 



4 RESULTS 

Figure [l] shows the ratio of the effective noise to the noise 
present considering the shear fields only, assuming the same 
number densities of galaxies for each probe, and the values 
for the intrinsic dispersion stated in table [T] The conversion 
multipole I (upper x-axis) to angular scale 9 (lower x-axis) 
follows 6 = 7r/(Z -1-1/2). We have adopt ed for the size disper - 
sion parameters the numbers from CVallinotto et al.ll201Gl ). 
who evaluated this number for the PES surv ey conditions 
dThe Dark Energv Survey Collaboratiorill2005l ). We refer to 
the discussion in ( Pires fc Amaral I2OI0I ) for our choice of 
fiexion dispersion parameters. The curves on this figure are 
rations and therefore independent of the galaxy immber den- 
sity. They are redshift independent as well, only to the ex- 
tent that the dispersion in intrinsic values can be treated as 
such. We can draw two main conclusions from figure[il First, 
fiexion information beings to play role only at the smallest 
scales, on the arcsecond scales, where it takes over and be- 
comes the most interesting probe. On the scale of 1 amin, it 
can bring substantial improvement over shear only analysis, 
but only in combination with the shears, and not on its own. 
This is in good agreement with the comparative analysis of 
the power of the flexi on T field and shears fields for mass 
reconstruction done in lPires fc Amaral l|20ld ). restricted to 
direct inversion methods. Second, the inclusion of size of 
galaxies into the analysis provides a density independent, 
scale independent, improvement factor of 



NZ 



ATJ + S 
"off 



1 + 



(89) 



which is close to a 10% improvement for the quoted num- 
bers. Of course, the precise value depends on the dispersion 
parameters of the population considered. 



3.4-2 Second order, flexion 

To second order, the distortions caused by lensing on the 
galaxies images are given by third order derivatives of the 
lensing potential. These are conveniently described by the 
spin 1 and s pin 3 flexion cornpone nts J- and Q, which in the 
notation of (ISchneider fc EJ|2008|) read 



J-- 



j_ /^,lll +l/',122 

2 VV',112 + ^",222 

1 /^,111 - 37/),l22 

2 V3^.112 - ^,222 



(87) 



For the purpose of measuring cosmological parameters 
rather than mass reconstruction, more interesting are the 
actual values of the Fisher information matrices. Since with 
any combination of such probes, these matrices are propor- 
tional to each other at a single mode, it makes sense to define 
the efficiency parameter of the probe i through 



»(0 := 



C^l) 



(90) 



c-(0 + ivyO 

which is a measure of what fraction of the information con- 
tained in the convergence field is effectively catched by that 
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100.0 



Figure 1. The ratio of the effective noise to the level of noise con- 
sidering the shears only, as function of angular scale. The dashed 
line considers the flexion fields alone. The dotted line shows the 
combination of the flexion fields with the shear fields, and the 
solid line all these weak lensing probes combined. No correlations 
between the intrinsic values for each pair of probes have been 
considered. 



probe. The information in the convergence field is, at a given 
mode I, counting the muhiplicity of the mode, 

^^.,-2(2; + !)^ a^' (^^) 

and we have indeed that the total Fisher information in the 
observed fields is 



f<f^=^^.";3(0f?(0- 



(92) 



Therefore, according to the interpretation of the Fisher 
matrix approximating the expected constraints on the 
model parameters, the factor e(l) is precisely equal to 
the factor of degradation in the constraints one would 
be able to put on any a parameter, with respect to the 
case of perfect knowledge of the convergence field at this 
mode. It is not the purpose of this work to perform a very 
detailed study on the behavior of the efficiency parameter 
for some specific survey and the subsequent statistical 
gain, but its qualitative behavior is easy to see. This 
parameter is essentially unity in the high signal to noise 
regime, while it is the inverse effective noise whenever the 
intrinsic dispersion dominates the observed spectrum. Since 
information on cosmological parameters is beaten down by 
cosmic variance in the former case, the latter dominates 
the constraints. We can therefore expect from our above 
discussion the size information to tighten by a few percent 
constraints on any cosmological parameter. On the other 
hand, while fiexion becomes ideal for mass reconstruction 
purposes on small scales, it will be able to help inference 
on cosmological parameters only if the challenge of very 
accurate theoretical predictions on the convergence power 
spectrum for multipoles substantially larger than 1000 will 
be met. 

To make these expectations more concrete, we evalu- 
ated the improvement in information on cosmological 
parameters performing a lensing Fisher matrix calculation 
for a wide, EUCLID-like survey, in a tomographic setting. 



For a data vector consisting of n probes of the convergence 
field Hi in each redshift bin i, i — 1,---7V, it is simple 
to see following our previous argument, that the Fisher 
information reduces to 



^-/5 = iE(2^ + l)TrC 



da a/3 ' 



where the C matrix is given by 



aj = c-^''^{i) + s.,N:s{i), 



i,j^l,N 



(93) 



(94) 



with N^ff given by (|7H) . The only difference between stan- 
dard implementatio ns of Fisher matri ces for lensing, such as 
the lensing part of IHu fc JainI (|200J), being thus the form 
of the noise component, we evaluated these matrices respec- 
tively for 



n' 






(95) 



which is the precise form of the Fisher matrix for shear anal- 
ysis, for 



Ni 



1 1 



"cff 



"cff 



which account for size information, and 



^eff(0 



NY 

cri 



7V"i' N- 



off 



(0' 



(96) 



(97) 



which accounts for the fiexion fields as well. We note that in 
terms of observables, these small modifications incorporate 
in its entirety the full set of all possible correlations 
between the fields considered. The values of the dispersion 
parameters involved in these formulae are the same as in 
table ll]). Our fiducial model is a flat ACDM universe, 
with parameters S7a = 0.7, fib = 0.045, fim = 0.3, h = 0.7, 
power spectrum parameters g g = 0.8, n = 1, and Linder's 
parametrisation (|Lindeil 120031 ) of the dark energy equation 
of state implemented as wo = — 1, Wa = 0. The distribu- 
tion of galaxies as function of redshift needed both for 
the calculation of the spectra and to obtain the galaxy 
densities in each bin was generated us ing the cosmological 
pac kage iCosmo (IRefregier et al.l 120081 ) . in a way described 
in (|Amara fc Refregieil bOOTl ). We adopted EUCLID-like 
parameters of 10 redshift bins, a median redshift of 1, 
a galaxy angular density of 40/amin^, and photometric 
redshift errors of 0.03(1 + 2). 

In figure [2] we show the improvement in the dark en- 
ergy Figure of Merit (FOM), defined as the square root of 
the determinant of the submatrix (cJojWa) of the Fisher 
matrix inverse F~a {a. and /3 running over the set of eight 
parameters as described above) , as function of the maximal 
angular mode /max considered, while Zmin being always 
taken to be 10. In perfect agreement with our discussion 
above, including size information (solid line) increases the 
FOM steadily until it saturates at a 10% improvement when 
constraints on the dark energy parameters are dominated 
by the low signal to noise regime. Also, flexion becomes 
only useful in the deep non-linear regime, where however 
theoretical understanding of the shape of the spectra still 
leaves a lot to be desired. 

These results are found to be very insensitive to the 
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1,2 - 



y + s 
7 + s ■ 




1.0 t 



10" 



10'' 



10" 



10" 



Figure 2. The improvement of tiic dark energy FOM including 
size information (solid), as well as flexion J-" and Q information 
(dotted), over the shear only analysis, as function of the maximal 
angular multipole included in the analysis. 



Table 2. Ratio of the marginalised constraints c'^ /Cgi^tj^,. ^^i > 
for /max = 10*. This first line considers the inclusion of the size 
information in the analysis, while the second the size as well as 
the flexion flelds J-" and Q. 



Ha 



Hh 



n„ 



h 



0-8 



wo 



0.90 
0.88 



0.96 
0.96 



0.90 
0.89 



0.95 
0.95 



0.95 
0.93 



0.90 
0.88 



0.90 

0.88 



0.90 
0.88 



survey parameters, for a fixed Us- There are also only 
weakly model parameter independent, as illustrated in 
table [21 which shows the corresponding improvement in 
Fisher constraints, 



(98) 



shear only 



aa. shear only 



at the saturation scale /max ~ 10*. These results are also es- 
sentially unchang ed using either standa rd implementations 
of the halo model ( Coorav fc Shethl2002l . for a review) or the 
the HALOFIT (jSmith et all |2003| ) non linear power spec- 
trum. 



1 19751 : Ivan den BosI 120071 ), after the identification of the 
curvature of the entropy surface with the generalised inverse 
of the covariance matrix. Especially, the maximal entropy 
distributions are precisely those for which the Cramer-Rao 
inequality is an equality, since the curvature of the entropy 
surface is the inverse correlation matrix between the model 
predictions. Equation (|33|) also bears a stron g formal 
similarity to the well known result (|Kullbackll 19591 . chap. 2) 
or (Caticha 2008 . chap. 6), that the Fisher information can 
always be written as the curvature of the KuUback-Leibler 
divergence for distributions parametrised by the same set 
of parameters. 

The Fisher matrices currently used in weak lensing or 
clustering can all be seen as special cases of this approach, 
namely equation (|53p . when knowledge of the statistical 
properties of the future data does not go beyond the 
two-point statistics. Indeed, in the case that the model does 
not predict the means, and knowing that for discrete fields 
the spectral matrices, equation (|47|l , carry a noise term due 
to the finite number of galaxies, or, in the case of weak 
lensing, also due to the intrinsic ellipticities of galaxies, 
the amount of information in (|53p is essentially identical to 
the standard expressions used to predict the accuracy with 
which parameters will be extracted from power-spectra 
analysis. 

There is, however, a conceptual difference worth not- 
ing in that the standard approach is to pick an estimator 
for the power-spectra and assume that both the fields as 
well as the distribution of the estimators are Gaussian. The 
result is the amount of Fisher information there is in the 
power spectra, under the assumptions of Gaussian statistics 
for the estimators and the fields. In our approach, the only 
assumption is on the fields distribution. Our results do not 
depend on the way the information will be extracted, but 
shows the amount of Fisher information in the fields as a 
whole. 

Of course, the maximal entropy approach, which tries 
to capture the relevant properties of px through a sophis- 
ticated guess, gives no guaranties that its predictions are 
actually correct. Nevertheless, as discussed in section [2^ it 
provides a systematic approach with which to update the 
probability density function in case of improved knowledge 
of the relevant physics. 



5 SUMMARY AND DISCUSSION 

We have shown how Jaynes' Maximal Entropy Principle 
allows us to construct the Fisher information content on 
model parameters in a given data set in the form of equation 
(|32p or H33p . This is done by making the key quantity the 
entropy of the distribution as function of the constraints 
that we put on it. These constraints form our knowledge 
of the statistical properties of the future data. To the best 
of the authors knowledge, equation p2p or (|33p are not 
to be found in this form in the literature. However, they 
cannot be considered new, since as stated earlier, they can 
be easily gained from the Fisher infor mation content of 
the exponential family of distributions (jjennrich fc Moord 



Using this formalism we have investigated the com- 
bined Fisher information content of weak lensing probes up 
to second order in the shapes distortions, assuming model 
parameter independent noise. By having a look at the joint 
Shannon entropy of the fields, we have shown how the only 
effect of treating these observables jointly is to reduce the 
effective level of noise contaminating the convergence field, 
according to equations (|71|l and ((73}, independently of the 
model parameters. 

These are the key points of this paper that we 
would like to emphasize : 



(i) Equation (|33|l presents a measure of information con- 
tent that depends only on the constraints put on the data 
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and the physical model. It is written in terms of the cur- 
vature of Shannon's entropy surface for maximal entropy 
distributions. It can always be interpreted, regardless of the 
actual distribution of the parameters, and of the specific way 
the analysis will proceed, as the expected curvature of the 
X^ surface to the full set of model predictions. Assumptions 
of Gaussianity are neither needed nor used at any point. 

(ii) Over a very wide range of scales, the probe of choice 
both for mass reconstruction or cosmological purposes are 
the ellipticity components of the galaxies. Flexion takes over 
only on the arcsecond scale. In combination with the elliptic- 
ities, it can lead to substantial increase in statistical power 
on the scale of the arcminute. From the cosmological point 
of view, we expect size information to contribute at the 10% 
level of the total information content. The only key param- 
eter is the combination H89p of the dispersion values and the 
permeability as of the population sizes to the convergence 
field. On the other hand, the prospects of including flex- 
ion in cosmological analysis are less clear. The most obvious 
drawback being the need for an accurate understanding of 
the non-linear power spectrum. 

(iii) Besides, our results render the inclusion of flexion 
and size information within more detailed Fisher matrix 
analysis for future dark energy experiments extremely sim- 
ple, such as in the exhaustive approach combining the in- 
formation galaxy den sity fl elds with shear fields in a to- 
mographic setting of (JBernstein 2009 ) . From 1)78^ follows 
namely that the inclusion of all the two-point correlations 
of these additional weak lensing probes can be accounted for 
by adapting the noise term A^off- 

Possible developments of this work includes the relaxation of 
the main limitation of the results, for instance the assump- 
tion that the noise is independent of the model parameters. 
Also, we plan to show that the approach presented in the 
first part of this work can lead to quantitative evaluations 
in non Gaussian cases as well, when other observables than 
the first two moments are considered. 
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APPENDIX A: CRAMER-RAO INEQUALITY 

In this section, we provide a unified derivation of the 
Cra mer-Rao i nequality in the multidimensional case (based 
on (|Rao|[l973)) and its relation to maximal entropy distri- 
butions. We denote the vector of model parameters of di- 
mension n with 



a= ai. 



(Al) 



and a vector of functions of dimension m the estimators 

t)=(b^,---bm), (A2) 



with expectation values Di{a) = (Di{x)). In the following, 
we rely on Gram matrices, whose elements are defined by 
scalar products. Namely, for a set of vectors yi, the Gram 
matrix Y generated by this set of vectors is defined as 



Yij =yi -Yj- 



(A3) 



Gram matrices are positive definite and have the same rank 
as the set of vectors that generate them. Especially, if the 
vectors are linearly independent, the Gram matrix is strictly 
positive definite and invertible. 

We adopt a vectorial notation for functions, writing scalar 
products between vectors as 



fg= dxpx{x,a)f{x)g{x), 



(A4) 



with px{x, a) being the probability density function of the 
variable X of interest. In this notation, both the Fisher in- 
formation matrix and covariance matrix are seen to be Gram 
matrices. We have namely that the Fisher information ma- 
trix reads 



^CtiOLj JoLi • JOLJ^ Jcti{X) 



dlnpx{x, a) 
dm 



(A5) 



while the covariance matrix of the estimators is 

Cij=gi-gj, gi{x,a) ^ Di{x) ~ Di(a). (A6) 

For simplicity and since it is sufficently generic for our pur- 
pose, we will assume that both sets of vectors / and g are 
lineary independent, so that both matrices can be inverted. 
Note that we also have 



-— ^ = dxpx{x,cx) bi{x) ^ 



dan 



1^-Uy (A7) 



The Gram matrix G of dimension ((m -\-n) x (ra -I- n)) gen- 
erated by the set of vectors (gi, ■ ■ ■ , gm, /ai , • • • , /a„) takes 
the form 



G 



C ^1 A - f 

T Tp ] ^ Llicij — gi ■ Jcj 



(AS) 



and is also positive definite due to its very definition. It is 
congruent to the matrix 



with 



YGY'^ = 



Y 









Fl ' 



-m Xm 

1 



AF-^ 



(A9) 



(AlO) 



Since two congruent matrices have the same number of pos- 
itive, zero and negative eigenvalues respectively and since 
both F and G are positive, we can conclude that 



C^ AF'^A'^, 



(All) 



which is the Cramer-Rao inequality. The lower bound on 
the amount of information is seen from the fact that for any 
matrix written in block form holds 



C A\ [F A^\ 



F 



C 



(A12) 



and using the same congruence argument leads to the lower 
bound on information 



F> A^C~^A. 



(A13) 
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Assume now that we have a probabihty density function 
such that this inequaUty is in fact an equahty, i.e. 



F = A^C^'A. 



(A14) 



By the above argument, the Gram matrix generated by 

iUi,--- ,U,,,gi,--- ,3m) (A15) 

is congruent to the matrix 







c 



(A16) 



and has rank m. By assumption, the covariance matrix is 
invertible, such that the set (ffi, ■ • ■ , <?m) alone has rank m. 
It implies that each of the / vector can be written as linear 
combination of the q vectors. 






(A17) 



or, more explicitly, 

2hME^ =±Mc.) [dA^) - D,ic.)\ , (A18) 



j=i 



where the key point is that the coefficients Aj are indepen- 
dent of X. Integrating this equation, we obtain 

m 

lnpx{x,a) = -^ Xi{a)Di{x) -In Z (a) +lnqxix) (A19) 

i = l 

for some functions A and Z of the model parameters only, 
and a function qx of x only. We obtain thus 



px{x, a) = ^ exp [ - ^ Xi{a)Di{x 



(A20) 



This is precisely the distribution that we obtain by max- 
imising the entropy relative to qx{x), while satisfying the 
constraints 

A(a) = ^AW), J = l,---,m. (A21) 

Taking qx as the uniform distribution makes it identical 
with the formula in equation (|24p . 



APPENDIX B: POINT FIELDS 

The data consists in a set of numbers, at each position where 
a galaxy sit and a measurement was done. We use the handy 
notation in terms of Dirac delta function. 



= y~^ei^°(x — Xi 



(Bl) 



where the sum runs over the positions Xi for which e is mea- 
sured. To obtain the spectral matrices, we need the Fourier 
transform of the field, which reads in our case 



(1) = ^eiexp(-zl -Xi) 



(B2) 



In this work, we assume that the set of points shows negli- 
gible clustering, so that the probability density function for 
the joint occurrence of a particular set of galaxy positions is 
uniform. 

We decompose in the following the wave vector k on the fiat 
sky in terms of its modulus and polar angle as 



i = ; 



cos (pi 






(B3) 



Bl Ellipticities 

When the two ellipticity components are measured, we have 
two such fields (j>i, 4>2 a-t our disposal. For instance, the field 
describing the first component becomes 



i(l) = ^e- exp(-il-Xi) 



(B4) 



We assume that the measured ellipticities trace the shear 
fields, in the sense that the measured components are built 
out of the shear at that position plus some value unrelated 
to it, 



4 = 7i(xi) + eint,i 
e? = 72 (Xi) + eL, i- 



(B5) 



The vector v relating the spectral matrices of the ellipticities 
and the convergence is then obtained by plugging (jB4|) with 
the above relations (IB5[) in its definition (|69p . and using the 
relation between shears and convergence in equation (|82p . 
The result is 



/cos 2(j>i 
\sin2(j>i 



(B6) 



where n^ is the number density of galaxies for which ellip- 
ticity measurements are available. Under our assumptions of 
uncorrelated intrinsic ellipticities, with dispersions of equal 
magnitude cr^ for the two components, the noise matrix A'^ 
becomes 



AT - i^i 

U (T^ 



(B7) 



The effective noise, given in equation (|7Hl is readily com- 
puted 



cri r- 



(B8) 



B2 Sizes 

As noted in the main text, the apparent sizes of galaxies are 
modified by lensing, in the following way. 



; (1 ~\- UsK.) 



(B9) 



for some coefficient a^ which is unity in pure weak lensing 
theory. Denoting the number of galaxies for which sizes mea- 
surements are available by Us, and the mean intrinsic size 
of the sample by Sint, the spectrum of the size field reduces, 
under the assumption of uncorrelated intrinsic sizes, to 



C"{1) = nJsfntaJC"'(0 + Usa'^ 



(BIO) 



The vector v and matrix N are now numbers, that are read 
out from the above equation, to be 



V - 

N ■- 



UsSintas 



2 



leading to the effective noise 



m 



Us 

Sint 



(Bll) 



(B12) 



16 Carron et al 



B3 Second order, flexion 

Denoting witli njr and ng tlie number of galaxies for which 
F and Q are measured, the vectors linking the flexion to 
convergence are 

/cos (f>i 



-ilnjr 



and 



'ilnc 



Vsin^i 

/cos 3(/>i 
\^sin 3(j)i 



(B13) 



(B14) 



Using again the assumption of uncorrelated intrinsic compo- 
nents, we have the four dimensional diagonal noise matrix 

_ ffljrGj: ■ l2a;2 



N ■- 







\ ngOg ■ 12x2 

leading to the effective noise, this time mode-dependent 



(B15) 



NSf 



,2 I np ng 

or 9 



(B16) 
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